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Abstract 

Background: DNA methylation serves as an important epigenetic mark in both eukaryotic and prokaryotic 
organisms. In eukaryotes, the most common epigenetic mark is 5-methylcytosine, whereas prokaryotes can have 
6-methyladenine, 4-methylcytosine, or 5-methylcytosine. Single-molecule, real-time sequencing is capable of 
directly detecting all three types of modified bases. However, the kinetic signature of 5-methylcytosine is subtle, 
which presents a challenge for detection. We investigated whether conversion of 5-methylcytosine to 
5-carboxylcytosine using the enzyme Tet1 would enhance the kinetic signature, thereby improving detection. 

Results: We characterized the kinetic signatures of various cytosine modifications, demonstrating that 
5-carboxylcytosine has a larger impact on the local polymerase rate than 5-methylcytosine. Using Tetl -mediated 
conversion, we show improved detection of 5-methylcytosine using in vitro methylated templates and apply the 
method to the characterization of 5-methylcytosine sites in the genomes of Escherichia coli MG1655 and Bacillus 
halodurans C-125. 

Conclusions: We have developed a method for the enhancement of directly detecting 5-methylcytosine during 
single-molecule, real-time sequencing. Using Tet1 to convert 5-methylcytosine to 5-carboxylcytosine improves the 
detection rate of this important epigenetic marker, thereby complementing the set of readily detectable microbial 
base modifications, and enhancing the ability to interrogate eukaryotic epigenetic markers. 

Keywords: Carboxylcytosine, DNA sequencing, epigenomics, methylation, methylcytosine, SMRT sequencing, Tet 
protein 



Background 

The DNA of most organisms is comprised of more than 
the four canonical bases (A, C, G and T). In mammals, for 
example, 5-methylcytosine (5mC) constitutes about 1% of 
all DNA bases and is found primarily in CpG dinucleo- 
tides. Methylation plays a critical role in the regulation of 
gene expression, genomic imprinting and the suppression 
of transposable elements [1]. Often referred to as the sixth 
base, 5-hydroxymethylcytosine (5hmC) is also found in 
many metazoan genomes [2]. 5hmC is converted from 
5mC by the Ten-eleven translocation (Tet) family of pro- 
teins [3,4] . Recently, it was discovered that Tet proteins 
can also convert 5mC to 5-formylcytosine (5fC) [5] and 5- 
carboxylcytosine (5caC) [6,7] . In humans, there are three 
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different Tet proteins (Tetl, Tet2, Tet3) that are all cap- 
able of this conversion [6,7] . It is currently thought that 
DNA demethylation may occur through this process of 
5mC oxidation followed by base excision repair [6,8], and 
possibly decarboxylation [9] . 

Many of the genomes of bacteria and archea also con- 
tain modified DNA bases [10]. The three most common 
forms of methylation are 6-methyladenine (6mA), 
4-methylcytosine (4mC) and 5mC. The primary function 
of methylation is DNA self-recognition via restriction- 
modification systems that protect the organism against 
invading DNA. However, there are methyltransferases 
(MTases), such as dam, that do not fall in restriction- 
modification systems and are important in chromosome 
stability, mismatch repair and replication [11]. There is 
some evidence that the presence of methylation can also 
impact gene expression [12]. Thus, detection and 
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identification of methylated bases in both prokaryotes 
and eukaryotes is essential to the complete understand- 
ing of genome function. 

The most common techniques for large-scale detec- 
tion of DNA methylation rely on bisulfite treatment of 
the DNA prior to sequencing. Sodium bisulfite chemi- 
cally deaminates cytosine residues to uracil, which are 
subsequently read out as thymine. Methylated cytosines 
are converted with much lower efficiency and thus 
remain cytosines. The presence of 5mC is inferred from 
comparing bisulfite-treated DNA sequences to an 
untreated reference. In standard bisulfite sequencing, 
5mC cannot be distinguished from 5hmC [13]. The con- 
version of 5mC to 5caC through the activity of Tetl 
[14] and 5hmC to 5fC through chemical conversion [15] 
followed by bisulfite sequencing runs has recently been 
exploited for the genome-wide sequencing of 5mC and 
5hmC. 

We have previously described a technique for the direct 
detection of modified DNA using single-molecule, real- 
time (SMRT®) sequencing [16,17]. SMRT sequencing 
involves the monitoring of a DNA polymerase as it makes 
a copy of a DNA molecule [18,19]. When the DNA poly- 
merase encounters a modified base on the template 
strand, its rate of progression changes in a characteristic 
way relative to an unmodified template with the same 
sequence context [16,17]. The speed of the polymerase is 
monitored by determining the length of time between the 
fluorescent pulses that indicate nucleotide incorporation. 
The time between pulses is called the interpulse duration 
(IPD). The change in IPD between a modified and control 
template varies in magnitude and position depending on 
the nature of the base modification and the local sequence 
context. We refer to these reproducible changes in IPD as 
the kinetic signature for that modification. 

Although many base modifications, such as 6mA, 4mC, 
5hmC and 8-oxo-guanine, are readily detectable in SMRT 
sequencing [16,17,20,21], the kinetic signature of 5mC is 
more subtle, requiring high sequencing fold coverage to 
make out the small effect on polymerase speed. The 
methyl group is small, and unlike for the case of 6mA and 
4mC, it is oriented towards the major groove and is not 
involved in base pairing - in fact the methyl group has to 
be readily accepted by DNA polymerases at this position 
as it is present on thymine, the other canonical pyrimidine 
base. We hypothesized that conversion of 5mC into a 
larger group may increase the magnitude of the kinetic 
signature during SMRT sequencing, thus enhancing the 
ability to detect 5mC. The Tet family of proteins carries 
out conversion of 5mC to several other modified forms of 
cytosine including 5hmC, 5fC and 5caC [6,7]. This strategy 
has been shown to be effective in the recently developed 
Tet-assisted bisulfite sequencing of 5hmC [14]. 



Here, we demonstrate that mouse Tetl (mTetl) can 
be used to enhance direct detection of 5mC during 
SMRT sequencing. Using synthetic templates made 
from oligonucleotides containing 5mC, 5hmC, 5fC or 
5caC modifications, we tested the kinetic signatures of 
each modification. We discovered that each of the moi- 
eties into which 5mC can be converted via Tet increased 
the magnitude of the kinetic signature, with 5caC having 
the largest effect. Next, we observed that oxidation of 
5mC to 5caC on either synthetic templates or in vitro 
methylated DNA enhanced our ability to detect posi- 
tions of 5mC. We then used our improved 5mC detec- 
tion method for the genome-wide characterization of 
MTase activities in two different bacterial strains. 

Results 

SMRT sequencing shows varying kinetic signatures for 
different cytosine modifications 

To determine the kinetic signatures for the four naturally 
occurring forms of cytosine with a modification on the 
fifth carbon atom, we designed synthetic SMRTbell tem- 
plates made from oligonucleotides with modified cytosines 
at specific template positions. Four modified synthetic 
SMRTbell templates were made, each containing two 
5mC, 5hmC, 5fC or 5caC modifications. The polymerase 
dynamics of each was analyzed by SMRT sequencing and 
compared with a control template of the same sequence 
but lacking the modifications. The kinetic signatures for 
each cytosine modification type are shown in Figure 1 as 
ratios of the average IPD value at each template position 
of the modified template relative to the unmodified con- 
trol. The positions of the modified bases are highlighted as 
red bars. As observed previously [17], the kinetic signature 
for 5mC is distinct from the background, but the magni- 
tudes of the IPD ratios are small, translating to relatively 
high sequencing coverage for detection of the modified 
positions with high confidence. Furthermore, the kinetic 
signature is spread out over multiple positions on the 
DNA template [17], likely due to effects of base modifica- 
tions on the polymerization rate extending across the 
entire footprint of the polymerase [22] . 

As the size of the chemical structure of the modification 
increases, the magnitude of the kinetic signature also 
increases. The IPD ratio peaks range from approximately 
two-fold for 5mC and approximately three-fold to higher 
than five-fold for 5fC and 5caC (Figure 1). For each modi- 
fication type, an extended signature consisting of multiple 
IPD ratio peaks was observed, with the most prominent 
signals at positions 0, +2 and +6 relative to the polymerase 
movement, with 0 being the position of the modification 
in the template. In most instances investigated here, the 
+2 peak was the most pronounced. As previously observed 
[16,17], the kinetic signatures for a given modification 
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Figure 1 Kinetic signals from SMRT sequencing for the four epigenetic markers of cytosine 5mC, 5hmC, 5fC and 5caC Synthetic 
oligonucleotides carrying two modified sites each (red bars) were subjected to SMRT sequencing and the polymerase kinetics compared by 
plotting the ratio of IPDs for each template position against a control template of identical sequence but lacking the modifications. The 
template is shown in the 5' to 3' direction from left to right, the polymerase movement is right to left across the template as indicated by the 
arrow. 5caC: 5-carboxylcytosine; 5fC: 5-formylcytosine; 5hmC: 5-hydroxymethylcytosine; 5mC: 5-methylcytosine; IPD: interpulse duration; SMRT: 
single-molecule: real-time. 



varied slightly depending on the surrounding sequence 
context. These differences in the pattern and magnitudes 
of the kinetic signatures for each of the four different 
modifications are a parameter that can be used to discri- 
minate between different modifications on the same DNA 
template, although they are not used in the current imple- 
mentation of the software. To further explore the effects 
of local sequence context on the kinetic signatures of 5mC 
and 5caC, we used a synthetic SMRTbell template that 
contained a modified base in a 5'-CG-3' sequence context, 
surrounded by two random bases on each side. Additional 
file 1 shows a heat map of IPD ratios for the 256 possible 
sequence contexts at each position from -3 to +6 relative 
to the modified position in the template. As observed pre- 
viously [17,20], the magnitude and position of the kinetic 
signals for both 5mC and 5caC are dependent upon the 
surrounding sequence context. The conversion of 5mC to 
5caC enhances the magnitude of the IPD ratio at each 
position where ratios above 1.0 are observed for 5mC, that 
is, positions 0, +2, and +6, and brings out an additional 
detectable signal at the -2 position for some sequence 



contexts. Tet conversion enhances the kinetic signals rela- 
tively evenly across all sequence contexts, which is appar- 
ent from the good preservation of the overall sequence 
context profiles. We are currently investigating possible 
additional correlations that could exist between different 
base positions in a given context. This could aid in the 
development of more refined identification algorithms. 

Enhanced detection of 5mC following conversion to 
5caC by mTetl 

Because 5caC has the largest kinetic signature, conver- 
sion of 5mC to 5caC should significantly improve the 
ability to detect 5mC in SMRT sequencing. The Tet 
family of proteins has been shown to convert 5mC to 
5caC in mammalian genomes [6,7]. This conversion can 
be over 97% for sequencing purposes and does not exhi- 
bit significant sequence context bias [14]. We tested the 
ability of Tetl-mediated oxidation of 5mC to 5caC to 
enhance direct detection on in vitro methylated DNA 
templates, described in detail in Methods. Briefly, we first 
generated an approximately 6-kb plasmid by inserting a 
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lambda DNA fragment into the pCRBlunt vector and 
subjected it to whole genome amplification (WGA) to 
erase all modifications. We then generated an approxi- 
mately 500 bp randomly sheared shotgun SMRTbell tem- 
plate library from the WGA material, followed by in vitro 
methylation using the Hpall MTase that modifies the 
internal cytosine in 5'-CCGG-3' sequence contexts. 
Considering both the forward and reverse DNA strands, 
the plasmid sequence contains 70 instances of the 5'- 
CCGG-3' sequence motif. Methylated positions within 
the SMRTbell templates were converted to 5caC by treat- 
ment with the Tetl enzyme. In vitro methylated (5mC), 
Tetl converted (5caC) and WGA control (no modifica- 
tion) libraries were then subjected to SMRT sequencing. 

Figure 2 shows the plasmid-wide view of IPD ratio 
data for the in vitro methylated (Figure 2a) and the 
mTetl-converted (Figure 2b) templates relative to the 
unmodified control. The IPD ratios for the 5mC-modi- 
fied templates are visible as small excursions from the 
background (Figure 2a). Following mTetl -mediated oxi- 
dation to 5caC (Figure 2b), the kinetic signature was 
enhanced by an average of approximately 4.6-fold, mak- 
ing all 35 instances of the MTase recognition motif 
recognizable as large excursions in the IPD ratio. The 
primary IPD ratio peaks for the 5caC sample again fell 
at the +2 position relative to the modification, consistent 
with the results obtained from the synthetically derived 
samples. Similar results were obtained with synthetic 
SMRTbell templates that were made with oligonucleo- 
tides containing 5mC modifications and subjected to 
conversion by mTetl (Additional file 2). 

Genome-wide analysis of 5mC DNA methyltransferase 
specificities in bacterial strains 

Most bacterial and archeal genomes contain DNA 
MTases. Many of these MTases are paired with restric- 
tion endonucleases as part of a restriction-modification 
system that protects the organism from foreign DNA 
[23]. These MTases typically methylate a specific 
sequence context, which blocks the activity of the 
restriction enzyme that recognizes the same site. The 
three most common types of methylation found in bac- 
teria and archea are 6mA, 4mC and 5mC. To test the 
ability of the mTetl-enhanced signal to detect 5mC in 
genomic DNA, we selected two bacterial strains that are 
known to express a 5mC MTase [24]. 

Escherichia coli K12 MG1655 is a well-studied, com- 
mon laboratory strain that is known to express three 
different MTases. EcoKdam is a 6mA MTase that modi- 
fies the adenosine in a 5'-GATC-3' sequence context 
(methylated base underlined). EcoKI is a type I MTase 
that modifies the sequence context 5'-GCAC(N6)GTT- 
3' and reverse complement 5'-AAC(N6)GTGC-3'. The 
5mC MTase is EcoKdcm that modifies the internal 



cytosine in a 5'-CCWGG-3', where W is either an A or 
a T. We made SMRTbell templates from randomly 
sheared E. coli K12 MG1655 genomic DNA, a portion 
of which was sequenced in its native form and another 
portion of which was subjected to the mTetl treatment. 
Both samples were sequenced to approximately 150 x 
per-DNA strand fold coverage. 

We carried out an unbiased search for sequence motifs 
that were enriched in proximity to genomic positions 
with large excursions from the expected IPD values (see 
Methods for details). For the native sample we identified 
the expected 5'-GATC-3' and the 5'-GCAC(N6)GTT-3' 
and/or 5'-AAC(N6)GTGC-3' sequence motifs, but 
observed low signal levels for the 5'-CCWGG-3' motif. 
However, following the mTetl conversion, we were able 
to identify the majority of 5'-CCWGG-3' motifs in the 
genome as modified. Figure 3 compares IPD ratio data 
over the entire E. coli genome before and after mTetl 
treatment. As expected, IPD ratio data for sites methy- 
lated with m6A did not change between the native and 
Tetl-converted samples (panel a, grey lines). By contrast, 
IPD ratio data for the +2 position of the 5'-CCWGG-3' 
sites (panel a, red lines) were significantly increased in 
the mTetl-treated sample, thereby improving detection 
of (icw-mediated methylated positions, with IPD ratio 
magnitudes now similar to m6A signals. The distribu- 
tions of IPD ratios for all methylated motifs are included 
in Additional file 3. 

To estimate the degree of enhancement in 5mC detec- 
tion by mTetl treatment (Table 1), we selected the 99 th 
percentile kinetic score of an off-target motif (5'- 
GGWCC-3') as the threshold for calling a genomic posi- 
tion as methylated (Figure 3c). Any kinetic score that was 
greater than this value was considered modified. In the 
native sample, only 455 (1.9%) of all genomic 5'-CCWGG- 
3' positions were detected above this background value. 
Upon conversion of 5mC to 5caC in the mTetl-treated 
sample, 22,913 genomic 5'-CCWGG-3' positions (95.2%) 
were detected as methylated. The off-target site was unaf- 
fected by the mTetl treatment, highlighting the specificity 
of the mTetl conversion to methylated DNA sites. Addi- 
tional file 4 shows the detection rate for all modified 
sequence motifs, including 6mA. This table also enumer- 
ates detection levels of additional off-target sequences that 
exhibit a consistently low percentage of sites above the 
detection threshold. 

We performed the same procedure for B. halodurans 
C-125, a bacteriocin-producing soil bacterium. The 
B. halodurans genome is predicted to have three different 
MTases [24], including one MTase that has the hallmarks 
of a 5mC-modifying enzyme. However, unlike for the 
E. coli sample, the exact sequence motifs and positions of 
the modifications are not known. Through SMRT 
sequencing, we were able to identify two methylated 



Clark ef al. BMC Biology 2013, 11:4 
http://www.biomedcentral.eom/1741-7007/11/4 



Page 5 of 10 




9 
7 
5 
3 
1 

3 

5 - 

7 
9 



15 
Q 



iJ i 



3850 3860 3870 3880 3890 3900 3910 3920 3930 3850 3860 3870 3880 3890 3900 3910 3920 3930 

Template position Template position 

Figure 2 Enhanced detection of 5mC through Tetl oxidation using in vitro methylated plasmid samples. The Circos plots show the 
plasmid-wide view of IPD ratios for (a) untreated and (b) mouse Tet1 -treated samples, with the outer and inner circle denoting the forward and 
reverse DNA strands, respectively, and the blue tick marks denoting all positions of the targeted M.Hpall in vitro methylated sequence motif of 
5'-CCGG-3' (methylated base underlined). The bracket denotes a section of the plasmid which is shown at base resolution in the bottom graphs, 
containing three instances of the methylated motif (grey boxes). The methylated positions are highlighted in red. IPD: interpulse duration. 



sequence motifs: 5'-GCATC-3' or 5'-GATGC-3' and 5'- 
GGCC-3'. The first motif had high IPD ratio values on 
the A position on both forward and reverse strands, 
which is indicative of 6mA. This signal was present in 
both native and mTetl-treated samples (Figure 4). The 
5'-GGCC-3' motif was considerably stronger in the 
mTetl-treated sample, with the strongest peak on the 
first G in the motif. Using the +2 pattern of the converted 
5caC signature, the most likely modified base is the inner 
C in the 5'-GGCC-3' motif. We detected 4.3% of 5'- 
GGCC-3' motifs without mTetl -treatment, increasing to 
76.7% following the enhancement of the 5mC signal by 
mTetl conversion (Table 1). The distributions of IPD 
ratios for all methylated motifs in the B. halodurans gen- 
ome are shown in Additional file 3 and the detection rate 
data are presented in Additional file 5. 



Discussion 

In SMRT sequencing, modified bases in the DNA tem- 
plate are identified by the transient slowing of the DNA 
polymerase at and around the site of the modification. 
We previously demonstrated the detection of 5mC and 
5hmC through such kinetic analysis [17]. Here, we 
extend the spectrum of detectable base modifications to 
the full complement of currently known modified forms 
of cytosine. Both 5fC and 5caC showed an increased 
interference with polymerase movement compared with 
5mC, resulting in stronger kinetic signals in SMRT 
sequencing. In addition to the increased size of the modi- 
fication, the higher polarity of the formyl and carboxyl 
group could also contribute to the increased signal levels. 

In this work, we describe improving the direct SMRT 
sequencing of 5mC via mTetl-mediated oxidation to 
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Figure 3 Genome-wide 5mC methyltransferase specificity detection in E. coli K12 MG1655 The Circos plots show the genome-wide view 
of IPD ratios for (a) untreated and (b) mTetl -treated samples, with the outer and inner circle denoting the forward and reverse DNA strands, 
respectively. The IPD ratios of the +2 position in 5'-CCWGG-3' sequence contexts are plotted in red, while IPD ratio data for all other contexts is 
plotted in grey. The graphs on the right show base-resolution IPD ratio views of a section of the genome containing one target site for adenine 
methylation by dam P-GATC^ 1 ) and one target site for cytosine methylation by dem (5'-CCWGG-3'). (c) Kinetic score distributions before and 
after mTetl conversion for all +2 positions of 5'-CCWGG-3' in the genome. An orthogonal off-target motif (5'-GGWCC-3') is also shown which was 
used to set a 1 % false discovery rate threshold (dashed line, see Methods for details) for tabulation of detected methylated positions (Table 1 ). The 
blue tick marks in the Circos plots of (a) and (b) denote 5'-CCWGG-3' genomic positions detected as methylated using that threshold. 5mC: 
5-methylcytosine; 6mA: 6-methyladenine; IPD: interpulse duration; mTetl: mouse Tetl. 



5caC, thereby reducing the relatively high sequencing 
coverage required to detect the subtle signals imparted 
by 5mC with high confidence. mTetl efficiently con- 
verted 5mC to 5caC in synthetic oligonucleotides, in 
vitro methylated plasmids, bacterial genomic DNA and 
mammalian genomic DNA [14], facilitating identifica- 
tion of microbial 5mC MTase specificities, thus comple- 
menting the other two common, readily detectable 
bacterial methylation marks of m6A and m4C described 



previously [16,17]. The protocol is rapid and specific to 
5mC, allowing all three base modifications to be simul- 
taneously detected in a single sequencing experiment. 
We anticipate that, for the sequencing of bacterial and 
archeal genomes, such comprehensive characterization 
of the methylome, in addition to de novo assembly of 
the genome [25,26], will improve our understanding of 
important microbiological phenomena, such as adapta- 
tion, pathogenicity and resistance evolution. It has been 



Table 1 Detection of 5mC in native versus mTetl -enhanced SMRT sequencing for the bacterial genomes 



Sample Methylation motif Number in genome Number detected Number detected (%) Number unassigned (%) a 



£ coli 


native 


C m CWGG 


24,079 


455 


1.9 


0.4 


MG1655 


Tetl 


C m CWGG 


24,079 


22,913 


95.2 


0.3 


B. halodurans 


native 


GG m CC 


15,207 


660 


4.3 


0.6 


C-125 


Tetl 


GG m CC 


15,207 


1 1 ,663 


76.7 


0.5 



a Unassigned is the percentage of genomic positions that have kinetic scores above the cutoff but are not in a methylated motif or a secondary peak. 
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Figure 4 Genome-wide 5mC methyltransferase specificity detection in B. halodurans The Circos plots show the genome-wide view of IPD 
ratios for (a) untreated and (b) mTetl -treated samples, with the outer and inner circle denoting the forward and reverse DNA strands, 
respectively. The IPD ratios of the +2 position in 5'-GGCC-3' sequence contexts are plotted in red, while IPD ratio data for all other contexts is 
plotted in grey. The graphs on the right show base-resolution IPD ratio views of a section of the genome containing two target sites for 
cytosine methylation (5'-GGCC-3'). (c) Kinetic score distributions before and after Tetl conversion for all +2 positions of 5'-GGCC-3' in the 
genome. An orthogonal off-target motif (5'-CCGG-3') is also shown which was used to set a 1% false discovery rate threshold (dashed line, see 
methods for details) for tabulation of detected methylated positions (Table 1). The blue tick marks in the Circos plots of (a) and (b) denote 
5-GGCC-3' genomic positions detected as methylated using that threshold. 5mC: 5-methylcytosine; 6mA: 6-methyladenine; IPD: interpulse 
duration: mTetl: mouse Tetl. 



demonstrated through bulk biochemical and genetic stu- 
dies that the dynamics of methylation in bacteria plays 
critical roles in basic cellular functions as well as directly 
affecting virulence [11,12,27]. 

The kinetic signatures of 4mC and 5caC are suffi- 
ciently different to allow for discrimination of the two 
types of cytosine modifications in bacteria. When 
sequencing through 4mC, the polymerase slows down 
only when incorporating the cognate nucleotide oppo- 
site the modification, with no significant secondary IPD 
ratio peaks [16]. By contrast, the primary IPD ratio peak 
for 5caC is located two bases after the modification (+2 
position). The combination of observing the sequence 
identity and the specific kinetic signature make it possi- 
ble to not only discover the presence of a base modifica- 
tion but also to determine the chemical identity of the 
type of modification. We are working on algorithmically 



harnessing this information contained in the kinetic sig- 
natures to expand the power of direct detection of mod- 
ified bases unique to SMRT sequencing [28]. Algorithms 
that incorporate IPD data from multiple positions across 
the entire footprint of the polymerase may further 
enhance the ability to detect and discriminate between 
modification types. This multi-site analysis and a further 
understanding of the sequence context dependence of 
the 5caC kinetic signature should improve detection of 
5caC, potentially reducing the sequencing coverage 
needed to detect converted 5mC positions even further. 

In higher eukaryotes, the epigenome is much more 
complex as at least four different forms of cytosine can 
occur and dynamically interconvert at epigenetically 
regulated genomic positions. Emerging evidence sug- 
gests that the Tet proteins and the modified cytosines 
they generate are crucial for a growing list of biological 
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processes, including zygotic epigenetic reprogramming, 
pluripotent stem cell differentiation, hematopoiesis and 
development of leukemia [2], Thus, methods for com- 
prehensive genome-wide mapping of all cytosine modifi- 
cations will be critical for epigenomic studies. Several 
methods have been described recently for discriminating 
between 5mC and 5hmC using bisulfite sequencing in 
combination with chemical or enzymatic conversion 
[14,15]. Since for a given sequence context in SMRT 
sequencing, the kinetic signatures of 5mC, 5hmC, 5fC, 
and 5caC are different, there is the potential for direct 
identification of the various modifications on native 
DNA samples. We are working to expand the bioinfor- 
matics analysis algorithms towards discrimination of 
different epigenetic marks, taking into account the dif- 
ferent signatures as a function of sequence context, as 
well as partial modification and mixtures of modification 
types. There are already several strategies for enhancing 
the kinetic signature of two cytosine modifications 
allowing for direct detection of 5mC and 5hmC in a sin- 
gle sample using SMRT sequencing. 5hmC positions can 
first be glucosylated [21], followed by Tetl-mediated 
oxidation of 5mC to 5caC. Glucosylated 5hmC will be 
protected from conversion and discrimination of the 
two forms can be made based on the differing kinetic 
signatures. We expect that these and further advances 
in the direct detection of modified bases during routine 
genome sequencing will become an important tool to 
further our understanding of genome and epigenome 
function. 

Methods 

Materials 

Custom oligonucleotides containing modified bases were 
synthesized on-site or purchased from Trilink Bio- 
Technologies (San Diego, CA, USA) and Integrated 
DNA Technologies (Coralville, IA, USA). All oligonu- 
cleotides contained 5' phosphate groups. The plasmid 
(pCRBlunt) was obtained from Life Technologies (Carls- 
bad, CA, USA). A list of the sequences can be found in 
Additional file 6. 

Bacterial strains and/or genomic DNA from bacterial 
strains were purchased from the American Type Culture 
Collection (Manassas, VA, USA). The following strains 
were used in this study: E. coli K12 MG1655, and 
B. halodurans C-125 (JCM 9153). 

SMRTbell template preparation 

Synthetic SMRTbell templates were made as previously 
described by ligating several synthetic oligonucleotides 
[20]. For plasmid and genomic DNA samples, an aliquot 
of approximately 25 ng of DNA was subjected to WGA 
using the REPLI-g Midi Kit (Qiagen, Valencia, CA, 
USA). WGA and native DNA was sheared to an average 



size of approximately 500 bp via adaptive focused acous- 
tics (Covaris, Woburn, MA, USA). SMRTbell template 
sequencing libraries were prepared as previously 
described [16,29]. SMRTbell libraries made from whole- 
genome-amplified pCRBlunt-6K plasmid were in vitro 
methylated using the Hpall MTase (recognition 
sequence: 5'-C 5m CGG-3'; New England BioLabs; Ips- 
wich, MA, USA) as per the manufacturer's instructions. 
Complete methylation was assessed by modifying 
lambda DNA in parallel and subjecting to methylation- 
sensitive restriction using the Hpall restriction enzyme 
(New England BioLabs). 

Tet1 conversion 

The 5mC modifications in SMRTbell template libraries 
were converted to 5caC using the 5mC mTetl Oxidation 
Kit from Wisegene (Chicago, IL, USA) as per the manu- 
facturer's instructions. Approximately 500 ng of SMRTbell 
templates were treated with the Tetl enzyme at 37°C for 
60 minutes followed by proteinase K treatment at 50°C for 
60 minutes. Converted SMRTbell templates were purified 
using Micro Bio-Spin 30 Columns (BioRad, Hercules, CA, 
USA) with additional purification and concentration using 
MinElute PCR Purification Columns (Qiagen). 

Sequencing and data acquisition 

SMRTbell templates were subjected to standard SMRT 
sequencing, as described [18,19]. Reads were processed 
and mapped to the respective reference sequences using 
the BLASR mapper [30] and Pacific Biosciences' SMRT 
Analysis pipeline [31] using the standard mapping proto- 
col. IPDs were measured as previously described [17] and 
processed as described [16] for all pulses aligned to each 
position in the reference sequence. 

For the bacterial methylome analysis [10], we used 
Pacific Biosciences' SMRTPortal analysis platform v. 
1.3.1, which uses an in silico kinetic reference and a f-test 
based detection of modified base positions [32]. The fol- 
lowing GenBank reference sequences were used: 
U00096.2 for E. coli K-12 MG1655 and BA000004.3 for 
B. halodurans C-125. MTase target sequence motifs were 
identified by selecting the top 1,000 kinetic hits and sub- 
jecting a ±20 base window around the detected base to 
MEME-ChIP [33], and compared to the predictions in 
REBASE [24] . To estimate the enhancement of detection 
of methylated 5mC positions (Table 1), we first selected 
an orthogonal off-target motif of similar sequence con- 
tent and calculated the kinetic score representing the 
99 th percentile of all genomic positions of that motif (5'- 
GGWCC-3' for E. coli (score threshold = 35.6); 5'- 
CCGG-3' for B. halodurans (30.4)). We then used this 
1% false positive detection threshold for determining the 
number of genomic positions of the on-target methyla- 
tion sites detected as methylated (Figures 3c and 4c; 
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Table 1). IPD ratio plots were visualized using Circos 
[34]. 

Additional data files 

The following additional data are available with the 
online version of the paper. Additional data file 1 is a 
figure that demonstrates the sequence context depen- 
dence of the kinetic signatures for 5mC and 5caC. Addi- 
tional data file 2 is a figure that shows IPD ratio data for 
synthetic SMRTbell templates before and after conver- 
sion of 5mC to 5caC. Additional file 3 is a figure with 
IPD ratio distributions for all methylated sequence 
motifs in E. coli and B. halodurans. Additional files 4 
and 5 are tables that contain detection rate information 
for all methylated motifs in E. coli and B. halodurans, 
respectively. Additional data file 6 is a table of oligonu- 
cleotide sequences used in this study. 

Additional material 



colored: 6mA (red), 5mC (blue). The interrogated base in the motif is 
underlined. Unassigned are genomic positions with kinetic scores above 
the cutoff which are not in a methylated motif or a secondary peak. 

Additional file 6: Table of oligonucleotide sequences used in this 
study 
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